In this lab, we will carry out some point pattern data analysis. We will cover constructing point pattern objects, basic data exploration and building simple models.

Before starting the lab, you will need to set up a new folder for your working directory. Go to your geog6000 folder now and create a new folder for today’s class called lab14. The following files will be used in this lab, all available on Canvas:

  • Locations of trees in tropical rainforest (bei.csv) and associated slope values (beislope.csv)
  • Zipped shapefile containing locations of redwood trees in a plot in Northern California: redwood.zip
  • Locations of six species of tree in the Lansing Forest, Michigan and associated shapefile of study area: lansing.zip

You will need to download these files from Canvas, and move them from your Downloads folder to the datafiles folder that you made previously. Make sure to unzip the zip files so that R can access the content. Note that on Windows, you will need to right-click on the file and select ‘Extract files’.

Now start RStudio and change the working directory to lab14. As a reminder, you can do this by going to the [Session] menu in RStudio, then [Change working directory]. This will open a file browser that you can use to browse through your computer and find the folder.

You will need two packages to carry out all the steps listed here: sf and spatstat, so make sure these are installed these before starting.

pkgs <- c("spatstat",
          "sf")

install.packages(pkgs)

With all the examples given, it is important to not just type in the code, but to try changing the parameters and re-running the functions multiple times to get an idea of their effect. Help on the parameters can be obtained by typing help(functionname) or ?functionname.

Point pattern data

For most of this lab we will use the ‘BEI’ data set, in the file bei.csv. This is a set of locations of 3605 trees in a tropical rainforest taken from Barro Colorado Island, Panama. All analysis of this data will be done using functions taken from the spatstat package, so start by loading this package.

library(spatstat)

Prior to all analysis, the locations must be read in and stored in a point pattern object (ppp). These objects contain, at minimum, the locations of the objects and a description of the window delimiting the study area (either as a box or a polygon). To create this for the BEI dataset, we (1) read in the data; (2) check the range of coordinates (with the summary() function); (3) create a bounding window using the owin() function; (3) create the ppp object with the X and Y coordinates and the window:

bei <- read.csv("../datafiles/bei.csv")

summary(bei)
##        x               y        
##  Min.   :  0.1   Min.   :  0.1  
##  1st Qu.:151.9   1st Qu.:103.5  
##  Median :342.3   Median :282.6  
##  Mean   :433.8   Mean   :263.5  
##  3rd Qu.:719.0   3rd Qu.:416.1  
##  Max.   :998.9   Max.   :499.9
bei.owin <- owin(xrange = c(0,1000), yrange = c(0,500))

bei.ppp <- ppp(bei$x, bei$y, window = bei.owin)

We can use plot() and summary() functions; these will adapt to the point process object.

plot(bei.ppp, pch = 16, cols = 'gray')

summary(bei.ppp)
## Planar point pattern:  3604 points
## Average intensity 0.007208 points per square unit
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
## 
## Window: rectangle = [0, 1000] x [0, 500] units
## Window area = 5e+05 square units

Note that the summary() function returns the average intensity or number of objects (trees) per unit area.

Quadrat counts and tests

As the distribution of the trees is clearly non-uniform or inhomogenous, we require different methods to assess how this distribution varies over space. The simplest method is to divide the area into quadrats and count the number of trees per quadrat. The parameters nx and ny define the number of quadrats along each coordinate.

bei.qc <- quadratcount(bei.ppp, nx = 6, ny = 3)

plot(bei.qc)

We can test for a significant departure from a uniform or homogenous distribution by using the function quadrat.test(). This compares the observed number against an expected number if the objects were distributed uniformly (number of trees / number of quadrats). The differences are used in a Chi-squared test, with the null hypothesis of homogeneity or equal distribution among the transects.

bei.qt <- quadrat.test(bei.ppp, nx = 6, ny = 3)

bei.qt
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  bei.ppp
## X2 = 2207.8, df = 17, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## Quadrats: 6 by 3 grid of tiles
plot(bei.qt, cex=0.8)

The \(p\)-value obtained suggests that we can reject the null hypothesis, and accept the alternative, that there is a inhomogenous distribution, which implies that there is some spatial process controlling the distribution of the trees.

As an alternative to the simple quadrat approach, we can use a covariate to see if there is an association between values of the covariate and the object distribution. Here we use gridded slope values for the study area, contained in the file beislope.csv, read into R as a pixel image object. In order for this to be used, we (1) read in the gridded dataset; (2) create a new window for the gridded data (on a 10m grid); (3) create the pixel object using the as.im() function. Note that we define the window size by hand by specifying minimum and maximum x and y coordinates of the region.

bei.slope <- read.csv("../datafiles/beislope.csv", header = FALSE)

bei.slope.owin <- owin(xrange = c(-2.5, 1002.5), yrange = c(-2.5, 502.5))

bei.slope <- as.im(as.matrix(bei.slope), W = bei.slope.owin)

plot(bei.slope)

plot(bei.ppp, pch = 16, cex = 0.7, add = TRUE)

To use the quadrat method with this data, we need to convert the slope into slope classes, then check for association with each class. We start be creating the classes: (1) calculate quartiles of slope values for the class boundaries; (2) use the cut() function to assign slope values to one of the four classes; (3) create a tessellation based on the classes, which will be used to identify which class a object belongs to (using the tess() function):

b <- quantile(bei.slope, probs = seq(0,1, by = 0.25))

bei.slope.cut <- cut(bei.slope, breaks = b, labels = 1:4)

bei.slope.tess <- tess(image = bei.slope.cut)

plot(bei.slope.tess, valuesAreColours=FALSE)

plot(bei.ppp, add = TRUE, pch = "+")

Now we can use the quadratcount() function, but use the tessellation, rather than a set number of grid boxes.

qb <- quadratcount(bei.ppp, tess = bei.slope.tess)

plot(qb, valuesAreColours=FALSE)

Finally, we can test for to see if the objects are preferentially clustered in the slope classes using the quadrat.test() function. Again, we test against a null hypothesis that the distribution should be equal across all classes:

bei.qt <- quadrat.test(bei.ppp, tess = bei.slope.tess)

bei.qt
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  bei.ppp
## X2 = 661.84, df = 3, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## Quadrats: 4 tiles (levels of a pixel image)
plot(bei.qt, cex = 0.8, valuesAreColours = FALSE)

The low \(p\)-value again suggests that we can reject the null hypothesis and state that the trees are not uniformly distributed across the slope classes.

Kernel density functions

Variations in the intensity of a spatial point process may also be investigated using a kernel density method. This fits two-dimensional Gaussian kernels (or windows) to the point objects, and effectively sums them across the area. Areas with higher densities of objects will therefore have a higher sum. These density functions provide a useful summary of variations in intensity and a good visualization method to examine a dataset for random or non-random distributions.

The densities are calculated using the density() function, which adapts to a ppp object. The most important parameter is sigma, which controls the bandwidth or size of the window fit to each point.

plot(density(bei.ppp, sigma = 60))

  • Re-run this command with different values of sigma to see the effect of changing this parameter

The bandwidth can be selected using cross validation. This can be done in a two step process by (1) selecting the bandwidth using bw.diggle, then using this in the density function. Note that this tends to give very conservative estimates of bandwidth:

bei.bw <- bw.diggle(bei.ppp)

plot(density(bei.ppp, sigma = bei.bw))

Distance functions

Distance functions can be used to investigate the interaction between points in space. Various methods exist, all based on the idea of calculating distances between points and other points, or fixed points in the study region. The most well-known of these is Ripley’s \(K\) function, which describes the distribution as the set of all pairwise distances between points.

We will run this using a different point data set, the redwoods data: redwood.shp. Read this into R, and create a point process object. As the data is in a shape file, we will need to use the st_read() function from the sf package. This package also includes a helper function as.ppp() to convert directly to a ppp object.

library(sf)

redwood.sf <- st_read("../datafiles/redwood/redwood.shp", quiet = TRUE)

redwood.ppp <- as.ppp(redwood.sf)
## Warning in as.ppp.sf(redwood.sf): only first attribute column is used for marks
redwood.ppp
## Marked planar point pattern: 62 points
## marks are numeric, of storage type  'double'
## window: rectangle = [0.1, 0.999] x [0.04, 0.92] units

The function has created a ppp object, but by default it uses the first column in the sf data frame as a mark, a label on each point. We’ll look at this later in the lab, but for now we want to ignore this by setting it to a NULL value. The other thing we’ll correct is the window size, setting the minimum and maximum limits to 0 and 1 for both the \(x\) and \(y\) axes.

marks(redwood.ppp) <- NULL

Window(redwood.ppp) <- owin(x = c(0, 1), y = c(0, 1))

plot(redwood.ppp)

Ripley’s \(K\)

Ripley’s \(K\) function is calculated using the Kest() function and similar functions exist for the \(F\) and \(G\) functions. Once calculated, we can plot out the results, including the observed values of Ripley’s \(K\) and a theoretical curve based on an homogenous poisson process with an intensity equal to our point process object.

redwood.kest <- Kest(redwood.ppp)

plot(redwood.kest)

If the point process data is effectively random (i.e. follows a poisson distribution), we would expect the observed line (black) to fall on top of or close to the theoretical line (blue). Above the theoretical line indicates clustering; below indicates a regular or ordered distribution. The green and red line represent \(K\) values calculated with different corrections for the border effect.

The redwood data appear to be clustered. To test if these are significantly different from a random distribution, we can run a Monte Carlo series of random simulations of homogenous poisson processes, using the function envelope(). This gives us an envelope of possible values of Ripley’s \(K\), which account for simple stochastic differences in random distributions. If the data are really clustered, we expect the observed Ripley’s \(K\) to fall outside this envelope. These random simulations are performed using the envelope() function, which requires:

  1. A point process object
  2. The function to be used (here Ripley’s \(K\); Kest)
  3. The number of random simulations to be performed (99)
redwood.kest.mc <- envelope(redwood.ppp, 
                            fun = 'Kest', 
                            nsim = 99, 
                            verbose = FALSE)

plot(redwood.kest.mc, shade = c("hi", "lo"))

Note that this uses point-wise estimates of uncertainty, and cannot be used as a post-hoc test. A better approach is to calculate the global uncertainty as the largest deviation between the randomly generated values of \(K\) and the theoretical value:

redwood.kest.mc <- envelope(redwood.ppp, 
                            fun = 'Kest', 
                            nsim = 99, 
                            verbose = FALSE, 
                            global = TRUE)

plot(redwood.kest.mc, shade = c("hi", "lo"))

Besag’s \(L\)

The \(L\)-function was proposed by Besag as a way to stabilize the variance of Ripley’s \(K\) and improve the interpretation. We can calculate this by simply replacing the function name in the envelope() function, as follows:

redwood.lest.mc <- envelope(redwood.ppp, 
                            fun = 'Lest', 
                            nsim = 99, 
                            verbose = FALSE, 
                            global = TRUE)

plot(redwood.lest.mc, shade = c("hi", "lo"))

Pair correlation function

The final function we will look at here is the pair correlation function. Instead of using cumulative pairs of distances, this is based on the number of pairs of points seperated by a band of distances. This has the advantage of providing a clearer idea of the range of interactions - as Ripley’s \(K\) is based on the cumulative set of distances, this can make it seem as though interactions are present over greater distances than they really are. Again, we can use the envelope() function, but this time we remove the global option by setting it to false, as this is no longer needed.

redwood.pcf.mc <- envelope(redwood.ppp, 
                           fun = 'pcf', 
                           nsim = 99, 
                           verbose = FALSE)

plot(redwood.pcf.mc, shade=c("hi", "lo"), ylim = c(0,5))

This shows positive interactions up to a range of about 0.07 map units, much less than in the corresponding \(K\)-function.

Marked point processes

In the previous sections, the point processes have been considered as single objects. Marked point process data include some information that distinguish the objects into different classes, allowing study of the co-occurrence (either positive or negative) between different classes of object. The Lansing forest data set contains the location of a set of trees in a forest in Michigan. Read this file in and take a look at the structure of the data, and you will see there is a column defining the species name of each tree.

lansing <- read.csv("../datafiles/lansing/lansing.csv")

str(lansing)
## 'data.frame':    2250 obs. of  3 variables:
##  $ x      : num  0.078 0.076 0.051 0.015 0.03 0.102 0.135 0.121 0.04 0.065 ...
##  $ y      : num  0.091 0.266 0.225 0.366 0.426 0.474 0.498 0.489 0.596 0.608 ...
##  $ species: chr  "blackoak" "blackoak" "blackoak" "blackoak" ...

We’ll now create a ppp object using the species to defined the marks or the labels of each point. Some differences from before: (1) the window describing the study area is in a shape file (lansing.shp and will need to be read in and converted to an owin object; (2) we need to specify the class information (the species names) when creating the ppp object, using the marks parameter. Note that we first need to convert this column to a factor so that R will recognize it as labels.

lansing$species <- as.factor(lansing$species)

lansing.win <- st_read("../datafiles/lansing/lansing.shp", quiet = TRUE)

lansing.ppp <- ppp(x = lansing$x, 
                   y = lansing$y, 
                   win = lansing.win, 
                   marks = lansing$species)
plot(lansing.ppp)

The plot shows all the different marks (species) plotted together. We can access different marks, using the split() function, and can use this to analyze the distribution of any single species:

plot(split(lansing.ppp), main = "All marks")

plot(split(lansing.ppp)$maple, "Maple trees")

While these plots show some clear evidence of species that have quite different distributions, the density() function can be used to make this clearer:

plot(density(split(lansing.ppp)), main = "Lansing density surfaces")

To examine the co-occurrence of two marks or species, we can again use Ripley’s \(K\) function. We use the cross version: Kcross(), which examines pairwise differences between objects from the two classes. Again, we use the envelope() function to simulate random distributions, and specify the two species (marks) as \(i\) and \(j\) in the function:

lansing.kc <- envelope(lansing.ppp, 
                       Kcross, 
                       i = "blackoak", 
                       j = "hickory",
                       nsim = 99, 
                       verbose = FALSE)

As before, we can plot the output as observed curves and the envelope of simulated random distributions. If the observed curve is above the envelope, this is evidence that the two species co-occur; if below then the two species tend to occur in different areas, suggesting some competitive interaction. If the observed curve is within the envelope, then the combined distribution is random.

plot(lansing.kc)

Try this again comparing species which have very different distributions, e.g. black oak and maple species.

Point process models

Point process models can be fit to any ppp object using the ppm() function. This uses the set of observed points to model the variations in intensity of a point process, usually based on some set of covariates. This function takes as its first argument, a ppp object, and a set of covariates as the second argument. Note that the second argument is the same syntax as the right hand side of a linear model in R. We start by building a simple model of a homogeneous Poisson process (i.e. with no covariates):

fit0 <- ppm(bei.ppp ~ 1)

fit0
## Stationary Poisson process
## Intensity: 0.007208
##              Estimate       S.E.   CI95.lo   CI95.hi Ztest      Zval
## log(lambda) -4.932564 0.01665742 -4.965212 -4.899916   *** -296.1182

This model returns a single parameter; the mean intensity for the region. As this is in log units, we can back convert as follows:

exp(coef(fit0))
## log(lambda) 
##    0.007208

Telling us there is about 0.007th of a tree in each square meter. To check this is right, let’s get the intensity directly from the bei.ppp object:

summary(bei.ppp)
## Planar point pattern:  3604 points
## Average intensity 0.007208 points per square unit
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
## 
## Window: rectangle = [0, 1000] x [0, 500] units
## Window area = 5e+05 square units

The following example models intensity as a second order polynomial function of the x and y coordinates of the objects. The polynom() function expands a set of variables into their second (or \(n^{th}\)) order form (i.e. \(x + y + x^2 + y^2 + x*y\) for second order coordinates).

fit1 <- ppm(bei.ppp ~ polynom(x, y, 2))

fit1
## Nonstationary Poisson process
## 
## Log intensity:  ~x + y + I(x^2) + I(x * y) + I(y^2)
## 
## Fitted trend coefficients:
##   (Intercept)             x             y        I(x^2)      I(x * y) 
## -4.275762e+00 -1.609187e-03 -4.895166e-03  1.625968e-06 -2.836387e-06 
##        I(y^2) 
##  1.331331e-05 
## 
##                  Estimate         S.E.       CI95.lo       CI95.hi Ztest
## (Intercept) -4.275762e+00 7.811138e-02 -4.428857e+00 -4.122666e+00   ***
## x           -1.609187e-03 2.440907e-04 -2.087596e-03 -1.130778e-03   ***
## y           -4.895166e-03 4.838993e-04 -5.843591e-03 -3.946741e-03   ***
## I(x^2)       1.625968e-06 2.197200e-07  1.195325e-06  2.056611e-06   ***
## I(x * y)    -2.836387e-06 3.511163e-07 -3.524562e-06 -2.148212e-06   ***
## I(y^2)       1.331331e-05 8.487506e-07  1.164979e-05  1.497683e-05   ***
##                   Zval
## (Intercept) -54.739290
## x            -6.592577
## y           -10.116084
## I(x^2)        7.400185
## I(x * y)     -8.078197
## I(y^2)       15.685769

And we can plot the fitted trend surface:

plot(fit1, 
     how = 'image', 
     se = FALSE, 
     pause = FALSE)

Earlier, we saw a relationship between tree location and slope. We can use the same function (ppm()) to model the intensity of the distribution using slope values. For a point process model, it is important to have values of the covariate at locations away from the points, as well as at the, Here, we use the slope image (bei.slope), specified using the usual R model syntax.

fit2 <- ppm(bei.ppp ~ bei.slope)

fit2
## Nonstationary Poisson process
## 
## Log intensity:  ~bei.slope
## 
## Fitted trend coefficients:
## (Intercept)   bei.slope 
##   -5.391053    5.026710 
## 
##              Estimate       S.E.   CI95.lo   CI95.hi Ztest      Zval
## (Intercept) -5.391053 0.03001787 -5.449887 -5.332219   *** -179.5948
## bei.slope    5.026710 0.24534296  4.545847  5.507573   ***   20.4885

The coefficient for the slope is about 5. Remember for regression models based on the log of the response variable, this is a multiplier, and reflects the increase in the intensity with each unit increase in slope. We can again plot the fitted trend surface:

plot(fit2, 
     how = 'image', 
     se = FALSE, 
     pause = FALSE)

Exercise

  1. The zip file ‘urkiola.zip’ contains two shapefiles. The first urkiola.shp contains point locations of two species of tree (oak and birch) in a Spanish National Park. The second, urkiolaWindow.shp, contains the park boundary as a polygon. Code for reading in these files, and converting them into a point process object (ppp) is given below. Once you have obtained the ppp object:
    • Give the intensity (spatial density) of the combined set of two tree species by using the summary() function
    • Plot the distribution of the two tree species as separate figures. You will need to use the split() function to access the data by individual species (e.g. split(urkiola.ppp)$oak will return only the location of the oak trees)
    • Now make plots of the kernel density of each species, and compare these plots. Do they suggest that the two species co-occur within the park?
    • Using Ripley’s \(K\) function, examine the individual distributions of the two species for non-random spatial organization. Use the envelope() function to test the obtained Ripley’s \(K\) against a hypothesis of Complete Spatial Randomness. Plot the results and state whether or not each species has a random distribution or not, and if not, whether the trees are clustered or have a regular distribution. Note that using summary() on the output of the envelope() function with give the significance level of the Monte Carlo test
    • Finally, test the spatial dependence between the two distributions to test if the two species truly co-occur or not. Use the envelope() function again, but with the Kcross() function to estimate the cross-dependence. Plot the results and state whether the distributions of the two species show positive, negative or no correlation with each other

R code for reading Urkiola shape file and converting to ppp object

# Tree locations
urkiola.sf <- st_read("../datafiles/urkiola/urkiola.shp")
# Park boundary
urkiola.win <- st_read("../datafiles/urkiola/urkiolaWindow.shp")
urkiola.win <- as.owin(urkiola.win)
# First extract coordinates
urkiola.crds <- st_coordinates(urkiola.sf)
## Convert to ppp
urkiola.ppp <- ppp(x = urkiola.crds[, 1], y = urkiola.crds[,2], 
                   marks = as.factor(urkiola.sf$tree), window = urkiola.win)
plot(urkiola.ppp)